Machine learning modeling of RNA structures: methods, challenges and future perspectives

您所在的位置：网站首页 › learning modeling › Machine learning modeling of RNA structures: methods, challenges and future perspectives

Machine learning modeling of RNA structures: methods, challenges and future perspectives

2024-07-15 19:25| 来源: 网络整理| 查看: 265

Abstract

The three-dimensional structure of RNA molecules plays a critical role in a wide range of cellular processes encompassing functions from riboswitches to epigenetic regulation. These RNA structures are incredibly dynamic and can indeed be described aptly as an ensemble of structures that shifts in distribution depending on different cellular conditions. Thus, the computational prediction of RNA structure poses a unique challenge, even as computational protein folding has seen great advances. In this review, we focus on a variety of machine learning-based methods that have been developed to predict RNA molecules’ secondary structure, as well as more complex tertiary structures. We survey commonly used modeling strategies, and how many are inspired by or incorporate thermodynamic principles. We discuss the shortcomings that various design decisions entail and propose future directions that could build off these methods to yield more robust, accurate RNA structure predictions.

RNA structure prediction, RNA, secondary structure, tertiary structure, machine learning, deep learning, review INTRODUCTION

Ribonucleic acid (RNA) is most canonically known as a messenger molecule that carries genetic information out of the nucleus to be expressed and translated into proteins, but these transcripts also perform a plethora of additional functions. These functions include regulating the expression and degradation of RNA transcripts, assisting in transporting proteins around the cell and even modifying deoxyribonucleic acid (DNA) chromatin regulation patterns [1]. Many works have linked RNAs’ diverse, critical functions to their folded structure [2–8].

Much like protein structures, RNA structures can be described in different hierarchies. At its most basic level is the primary structure, which simply encapsulates the sequence of adenine (A), guanine (G), cytosine (C) and uracil (U) nucleotides. The secondary structure captures the pairwise interactions between nucleotides in an RNA sequence. These commonly include structures like hairpins and bulges driven by Watson–Crick base pairing (i.e. A/U and C/G), but also encompass rarer but structurally and functionally critical non-canonical base pairing interactions [9–11] and pseudoknot structures, which form when nucleotides within a loop pair with nucleotides beyond the helices that close the loop [8, 12, 13]. Tertiary structure describes how secondary structure elements are arranged together in three-dimensional space and are typically driven by longer-range interactions. Quaternary structure describes how complexes of multiple RNA molecules, or heterogeneous combinations of nucleic acid or protein molecules assemble into larger units. Although the structural vocabulary describing RNAs is very similar to that describing protein structures, there are several key differences in their folding behaviors. Most notably, for RNAs, the secondary structure tends to be more stable than the tertiary structure, whereas proteins tend to rely more on stabilizing tertiary interactions [14].

Despite the importance of both secondary and tertiary structures in understanding RNA function, it remains time-consuming and expensive to experimentally profile 3D RNA structures [15]. This has motivated a large body of literature that aims to computationally predict the folded structure of an RNA given its nucleotide, or primary, sequence. The success of these methods would eventually allow researchers to study and profile RNA molecules with much greater throughput and efficiency and could even aid in the development of computational methods that generate new RNAs with desirable, targeted structural properties. Methods predicting RNA structure can be broadly divided into three categories: thermodynamic approaches, evolutionary approaches and machine-learning approaches. Thermodynamic approaches try to estimate the structural configuration of an RNA molecule with the lowest energy, frequently via dynamic programming algorithms with thermodynamic scoring schemes. Examples of such works include RNAfold [16], MFold [17] and LinearFold [18]. Evolutionary approaches leverage patterns in the conservation and variation of different nucleotides to estimate an overall structure; examples of such works include TurboFold II [19], IPKnot [20], McCaskill-MEA [21] and PETFold [22]. Energetic and evolutionary approaches are also frequently integrated, as done in the RNA secondary structure prediction method PETFold [22] and tertiary structure prediction tool FARFAR2 [23]. Although thermodynamic and evolutionary approaches leverage different principles, they are both derived using expert knowledge to specify and implement a known function |$f$| that calculates a predicted structure |$y$| given input nucleotides |$x$| i.e. |$f(x)=y$|⁠.

Machine learning (ML) approaches deviate from this paradigm by using data to implicitly learn the function |$f(x)$|⁠. Human experts do not define an explicit relationship between inputs and outputs, but rather specify a model architecture describing the types of relationships that can be expressed by the model. The parameters of that model are then learned by applying optimization techniques to data. For example, if an expert believes that an output should be linearly related to input variables, they may choose to train a linear regression model; a linear regression model has relatively few tunable parameters and thus can be effectively trained with only a few data points but cannot capture more complex nonlinear relationships. ML encompasses approaches like linear/logistic regression, support vector machines (SVMs), decision trees and k-nearest-neighbors classifiers; more recently, a subclass of ML called deep learning (DL) has gained popularity, enabled by advances in computational throughput and data availability. DL models leverage huge numbers of model parameters to approximate complex, high-dimensional, non-linear relationships [24] and have achieved substantial strides in computer vision, natural language processing (NLP), machine translation and more. However, DL models’ high expressivity can also lead to overfitting—where a model perfectly recapitulates training examples but cannot generalize to new, unseen data.

In this work, we provide a review of ML (including DL) works that attempt to predict the secondary or tertiary structural configuration of an RNA strand given its primary sequence of nucleotides (or a primary sequence processed to include additional annotations). We do not focus on the rich body of work that attempts to predict structural configurations using thermodynamic or evolutionary approaches, which is discussed in other works [25–28]. Furthermore, although the quaternary structure is key to understanding many RNA interactions, most ML methods in this vein have focused on predicting whether and/or where an RNA interacts with another biomolecule, such as a protein, and less so on modeling the precise structural state of that interaction [29–31]; we thus omit these works from our structure-focused review. We first describe common datasets used for modeling RNA structure, as well as metrics commonly used to quantify performance on these datasets. Building off this, we review works in the literature that attempt to predict secondary structure, first covering classical, non-DL approaches and subsequently focusing on more recent efforts leveraging DL. We then review works predicting tertiary structures. We conclude by discussing the themes emerging across these works and potential avenues for extending these ideas to build more reliable ML predictors of RNA structure.

Data representations, datasets and common metrics

RNA structure can be broadly described in terms of primary, secondary, tertiary and quaternary structures. Primary structure (or primary sequence) is the sequence of nucleotides comprising an RNA molecule and is frequently simply represented as plain text.

RNA secondary structure describes which pairs of nucleotides in an RNA sequence are interacting. These pairings are typically experimentally obtained using footprinting or proximity ligation strategies [32]. Footprinting strategies modify RNA molecules in a structure-specific manner, and upon sequencing, these modifications reveal how likely each nucleotide is to be paired or unpaired; examples of such methods include DMS-MaPseq [33], icSHAPE [34], DMS-seq [35] and structure-seq [36]. These footprinting methods do not capture the interaction partner for each nucleotide. To obtain a secondary structure from such data, experimental footprinting data are used as pseudo-energies or constraints for computational methods like ViennaRNA [37], RNAsc [38] and DREEM [39], which attempt to estimate a secondary structure that minimizes energy estimates while satisfying experimental readouts. Unlike footprinting methods, proximity ligation methods induce cross-linking between interacting nucleotides to identify pairing partners upon downstream sequencing. Examples of ligation methods include PARIS [40], SPLASH [41], COMRADES [42], SHARC [43], hiCLIP [44] and RPL [45]. These methods typically require less complex computational post-processing to obtain secondary structures as they explicitly identify interaction partners.

In addition to these experimental protocols, RNA secondary structure can also be indirectly inferred with several methods. RNA secondary structures can be extracted from RNA tertiary structures (discussed in detail below). RNA secondary structure can also be deduced using comparative genomics, whereby sequence conservation and covariation are used to identify regions in the RNA sequence that are likely paired. Several databases, such as the comparative RNA website (CRW) [46] and Rfam [47] have applied this methodology, via computational packages like R-scape [48], to accurately profile structures for many well-conserved RNAs, often with manual curation efforts to ensure consistent, high-quality structures. Several works focus on the centralized curation of datasets of RNA secondary structures gathered using experimental and computational methods. The bpRNA project aggregates and annotates information from Rfam, CRW and other databases to build a large, centralized database of RNA secondary structures [49]. Other datasets have also been assembled and published by methods training models to predict RNA structure (Table 1).

Table 1 Open in new tab

Common datasets of experimentally determined RNA structure datasets; non-exhaustive. Of these, only PDB provides comprehensive 3D annotations. *Rfam, and datasets derived from it define families based on sequence and structure conservation. This differs from ArchiveII or RNAStralign, which curate families by biological function (e.g., RNAStralign defines its 8 families as 5 s ribosomal RNAs, group I introns, tmRNAs, tRNAs, 16S ribosomal RNAs, signal recognition peptide RNAs, RNase P Rna, and telomerase RNAs). Consequently, the family counts are not directly comparable

Dataset . Release year . Structural annotations . Number of structures/sequences . Number of families . RNAStralign [19]2017Secondary37 1498ArchiveII [50]2016Secondary397510bpRNA-1 m [49]2018Secondary102 3182588*bpRNA-new [51]2021Secondary54021500* (estimated)Rfam [47] (release 14.9)OngoingSecondary (with limited tertiary)90 1904108*PDB dataset [52] (as of Dec. 2022)OngoingTertiary1680 (RNA-only)Not annotatedDataset . Release year . Structural annotations . Number of structures/sequences . Number of families . RNAStralign [19]2017Secondary37 1498ArchiveII [50]2016Secondary397510bpRNA-1 m [49]2018Secondary102 3182588*bpRNA-new [51]2021Secondary54021500* (estimated)Rfam [47] (release 14.9)OngoingSecondary (with limited tertiary)90 1904108*PDB dataset [52] (as of Dec. 2022)OngoingTertiary1680 (RNA-only)Not annotatedTable 1 Open in new tab

Secondary structure can be represented most simply as per-nucleotide paired/unpaired annotations, or using more expressive dot-bracket notation (DBN) or pairwise contact maps (Figure 1A). DBN is a balanced parentheses syntax that expresses whether each nucleotide is unpaired (‘.’), paired to another nucleotide ahead of it (‘.’) or paired to a nucleotide behind it (‘)’). Pairwise contact maps are symmetric binary matrices where the |$\left(i,j\right)$| cell indicates whether the corresponding |$\left(i,j\right)$| nucleotides are interacting. Their binary nature distinguishes them from contact maps, which contain continuous values indicating the distance between pairs of atoms. These representations can all express canonical Watson–Crick base pairing schemes, as well as non-canonical pairings. Standard DBN cannot unambiguously express pseudoknots, but several works have extended DBN with additional brackets ([], , etc.) to effectively express pseudoknots [49].

Figure 1

Illustrations of ML workflows predicting RNA secondary (A) and tertiary (B, C) RNA structures. For predicting secondary structure (A), an input nucleotide sequence is given to a model which predicts either a contact map or a DBN; both outputs specify secondary structure with loops and stems, as illustrated in the top right. Commonly used model architectures are sketched in Figure 2. There are two primary approaches for predicting tertiary structure. Scoring methods use an external (typically non-ML) method to generate a candidate structure for an RNA sequence, which is then consumed by an ML model which estimates the difference between the candidate and unknown true structure (B). The tertiary structure can also be generated directly from a sequence and optional annotations, such as MSA and secondary structure; these direct generation models output either constraints that are energetically refined to produce a structure, or directly produce a 3D structure (C).

Open in new tabDownload slide

All these secondary structure representations are amenable to prediction via ML algorithms in the form of binary classification tasks. For per-nucleotide pairing annotations and DBN, this classification task asks whether each nucleotide is paired or unpaired, and optionally whether it is an opening or closing bracket. For pairwise contact map prediction, the classification task asks whether each combinatorial pair of nucleotides is interacting or not. These framings allow for the usage of canonical metrics such as precision, recall, accuracy and F1 score (sometimes referred to as F-score), which combines precision and recall into a single value [53].

RNA tertiary structure describes the full 3D conformation of an RNA molecule and is experimentally obtained using methods like X-ray crystallography, small-angle X-ray scattering, nuclear magnetic resonance and cryo-electron microscopy [32]. Compared with secondary structure profiling techniques, which are frequently built around relatively efficient next-generation sequencing technologies, these methods are time-consuming and low-throughput. Consequently, there are fewer complete 3D structures available for RNAs. The protein data bank (PDB) [52], a constantly growing database of biomolecular 3D structures, contains 1680 RNA-only structures as of December 2022, compared with 173 022 protein-only structures.

RNA tertiary structure is expressed as a series of coordinates describing the position of each part in an RNA structure, typically at atomic resolution. A common metric for evaluating the correctness of a predicted set of 3D coordinates is a root-mean-square deviation (RMSD), which captures the average distance (typically in angstroms, Å) between each atom across two superimposed structures; larger values indicate greater deviation and a poorer match. In addition, TMscore [54] and the local distance difference test (lDDT) [55] are metrics originally proposed for evaluating the structural similarity of two proteins; both have been used to evaluate RNA structural similarity.

Regardless of the data representation or dataset, a critical consideration in evaluating ML works is the construction of the test dataset. As ML methods, particularly DL models, can memorize and overfit training data, a held-out test set not used for fitting/training model parameters is required for evaluating how well the model might generalize to new, unseen data. The simplest way to create this test set is to randomly partition available data. However, this strategy means that the testing dataset could contain examples extremely similar to the training set examples. Although such similar examples are technically not seen in training, they are nonetheless relatively easy for the model to predict; their presence in the test set can suggest misleadingly strong generalizability. Some works try to mitigate this by removing test set sequences with high nucleotide sequence similarity to any training sequence. Although this is somewhat more rigorous than random splits, this does not guarantee that the test set does not contain easy derivatives of training data, as RNA structure is frequently more strongly conserved than its underlying sequence [56, 57]. The most rigorous test set splits are thus defined using RNA families, where members of that entire family of RNA structures are never seen during training. Since different families of RNA represent structurally distinct groups, this test set construction is the most challenging and most informative evaluation of a structure predictor.

PREDICTING SECONDARY STRUCTURE Classical machine learning

RNA secondary structure ML models consume a nucleotide sequence and predict a contact map or DBN specifying a structure (Figure 1A). Many seminal models for RNA secondary structure prediction were built around stochastic context-free grammars (SCFGs, Figure 2A). An SCFG learns the probabilities associated with a set of transitions over a grammar of tokens; these transitions define the likelihood of allowable additions to a sequence i.e. describing the probability of adding an A–U pair, the probability of adding a G–C pair, the probability of adding an unpaired A nucleotide, etc. The combination of transitions parsing a sequence that produces the highest joint probability yields a corresponding most stable predicted structure [58]. Examples of works leveraging SCFGs in predicting RNA secondary structure include KH-99 [59] and Pfold [60] (which builds upon KH-99).

$Diagrams of various machine learning approach predicting RNA secondary structure from sequence. An SCFG (A) specifies a set of transitions (left) with associated probabilities (not shown) that are used to parse an RNA sequence (right) into its highest-likelihood structure. In a recurrent architecture (B), input nucleotides are processed sequentially, taking into account and updating the hidden state $h$ derived from previously seen tokens. The hidden state is used to predict structural properties for each nucleotide, such as whether each nucleotide is paired or unpaired (i.e. ${y}_i$), using a classifier ‘head’ $g\left({h}_i\right)$. In a convolutional architecture (C), a PWN-like convolutional kernel is learned to detect motifs; this kernel is scanned across the input to produce a hidden state capturing where certain motifs occur, optionally followed by additional convolutions (not shown). This hidden state is used to predict secondary structure. Transformer-based networks (D) take a series of discrete input tokens (i.e. nucleotides), embed them into a continuous representation that captures the identity and position of each token and feed these through a series of transformer blocks. Each transformer block applies attention, a mechanism that learns how strongly each pair of tokens are related (right). These transformer blocks produce a per-token embedding that can be used to predict secondary structures.$ Figure 2

Diagrams of various machine learning approach predicting RNA secondary structure from sequence. An SCFG (A) specifies a set of transitions (left) with associated probabilities (not shown) that are used to parse an RNA sequence (right) into its highest-likelihood structure. In a recurrent architecture (B), input nucleotides are processed sequentially, taking into account and updating the hidden state |$h$| derived from previously seen tokens. The hidden state is used to predict structural properties for each nucleotide, such as whether each nucleotide is paired or unpaired (i.e. |${y}_i$|⁠), using a classifier ‘head’ |$g\left({h}_i\right)$|⁠. In a convolutional architecture (C), a PWN-like convolutional kernel is learned to detect motifs; this kernel is scanned across the input to produce a hidden state capturing where certain motifs occur, optionally followed by additional convolutions (not shown). This hidden state is used to predict secondary structure. Transformer-based networks (D) take a series of discrete input tokens (i.e. nucleotides), embed them into a continuous representation that captures the identity and position of each token and feed these through a series of transformer blocks. Each transformer block applies attention, a mechanism that learns how strongly each pair of tokens are related (right). These transformer blocks produce a per-token embedding that can be used to predict secondary structures.

Open in new tabDownload slide

CONTRAfold [61] improved upon SCFGs using a conditional log-linear model (CLLM), whose formulation allows for the incorporation of additional features beyond those easily expressible in an SCFG grammar, such as the length of internal loops, terminal mismatch, etc. Despite its relative simplicity compared with recent DL models, CONTRAfold remains one of the most generalizable ML methods for predicting RNA secondary structures, frequently outperforming DL models on unseen structural families [62, 63]. EternaFold recently extended CONTRAfold by adding multitask learning objectives [63]. Multitask learning predicts multiple, often related outputs; doing so helps models achieve better generalizability, as they cannot as easily overfit and overspecialize to a single particular target [64].

Another class of classical learning algorithms used to predict RNA secondary structure is structured support vector machines (sSVMs). SVMs conventionally support classification (label or class prediction) and regression (continuous value prediction); sSVMs extend the output space of SVMs to be able to handle complex labels, such as the structured DBN describing RNA secondary structure. MXfold combines a trained sSVM with thermodynamic calculations to predict optimal secondary structures [65]. Other works have proposed an sSVM to predict RNA secondary structure without thermodynamic elements [66]. Compared with SCFGs (and related approaches), works based on sSVM tend to have improved computational complexity and consequently run faster, but do not appear to necessarily produce more accurate predictions.

Deep learning

RNA sequences can be viewed as a biological language with a four-nucleotide alphabet: A, C, U and G. Thus, a natural approach is to take successful DL models analyzing human language and apply them to the language of RNAs. One such class of models is recurrent neural networks (RNNs). RNNs maintain a hidden state that is updated as the network encounters each new token (e.g. each nucleotide); this hidden state can be used to make per-token predictions (for example, annotating each word in a sentence as beinga noun, verb or another part of speech, Figure 2B). Since RNNs consume one token at a time, they have the advantage of being able to handle input sequences of arbitrary length. However, RNNs have been observed to ‘forget’ information seen long ago as they continually update their hidden state. Long short-term memory (LSTM) networks build upon RNNs to improve the handling of such long-range information [67]; bidirectional LSTMs (bi-LSTMs) further extend these recurrent models by considering the flow of information in both the forward and reverse directions across a sequence. LSTMs and their variants have seen success in NLP [68] and biological sequence analysis [69], and have been applied to RNA secondary structure prediction in several works. DMfold [70] applies a bi-LSTM network to predict, for each nucleotide in an RNA strand, the extended DBN corresponding to that nucleotide. The authors report strong performance on a randomized test set but do not report performance on unseen families of RNAs. RNA-state-inf [71] trains a bi-LSTM to predict whether each nucleotide is paired.

One downside of modeling RNA as a language across four nucleotides is that each individual nucleotide carries little meaning by itself—after all, secondary structures like stems typically involve a series of base pairing interactions, rather than a single interaction. Analogously, modeling the English language as a sequence over the 26 characters in the alphabet raises similar challenges where each letter carries little meaning. In NLP, this has led to models that consider more meaningful groupings of characters forming common prefixes, suffixes and other linguistic fragments. These are derived using a heuristic frequently known as byte pair encoding [72]. For RNA, however, it is difficult to similarly define meaningful sequence motifs. To address this, some researchers have turned to learning these motifs using convolutional neural networks (CNNs).

CNNs were first proposed in computer vision [73] and learned sets of convolutional kernels detecting visual patterns (such as edges or textures) within a small patch of pixels in an image. These kernels are frequently stacked to learn higher-order features, like a specific combination of edges and textures that might form an image of a dog. When applied to biological sequences, CNNs’ kernels function like learnable position-weighted matrices [74] that detect (combinations of) sequence motifs in DNA [75, 76], RNA [77, 78] and protein [79] sequences (Figure 2C). Owing to their motif learning and detection capabilities, CNNs provide a potential solution to the aforementioned challenge of modeling RNA using a vocabulary more meaningful than individual nucleotides. SPOT-RNA2 trains an ensemble of CNNs on RNA sequences augmented with evolutionary information [80]. CROSS [81] does not explicitly state the usage of a CNN, but their approach of learning a sliding window that considers nucleotides preceding and following each nucleotide to predict paired and unpaired states follows the same design ethos of a CNN.

However, unlike RNNs, CNNs cannot handle inputs of arbitrary length and instead rely on truncating and padding sequences to be of uniform size. Given the variability in RNA transcript lengths, many models have been proposed to combine the motif detection capabilities of CNNs with RNNs’ ability to handle arbitrary input lengths. RPRes [82] applies a bi-LSTM whose output is fed through a CNN to predict whether each base is paired or unpaired. 2dRNA [83] passes RNA through a bi-LSTM and subsequently a CNN to predict a pairwise contact map. SPOT-RNA [84] and Mxfold2 [51] swap the order of model building blocks, first applying a CNN followed by a bi-LSTM.

Other works have leveraged transformers-based architectures for predicting RNA secondary structure. Transformers output an embedding (i.e., a continuous and rich numeric representation) for each token in their input sequence, and are built around the attention mechanism [85], which captures how much each token in the input sequence is related to other tokens (Figure 2D); in a sentence, attention might highlight that the embedding of the pronoun ‘they’ is heavily influenced by the subject. Transformer-based models initially found great success in NLP [72], and have since been adapted for domains such as image classification [86] and biological sequence analysis [87]. Transformers’ broad success has inspired several works applying this architecture to predicting RNA secondary structure. ATTfold [88] passes an RNA sequence through a transformer to produce per-nucleotide embeddings, which are given to a CNN to predict an N×N base pair scoring matrix. This scoring matrix is constrained and refined to produce a final structure. Like ATTfold, E2Efold [89] also passes an RNA sequence through a transformer and CNN to obtain an N×N contact score matrix. However, E2Efold proposes a novel, differentiable post-processing network to produce a viable structure from these scores. Both ATTfold and E2Efold report high performance on randomly defined test sets, but both may struggle to meaningfully generalize to novel structures. When evaluated on unseen RNA families in the bpRNA-new dataset, E2Efold was found to have an F1 of just 0.036 [62]. This is orders of magnitude worse compared with an F1 of 0.686 based on a randomized split within the ArchiveII dataset. The authors of 2dRNA-LD [90] and SPOT-RNA2 [80] similarly found E2Efold to have extremely poor generalization performance. ATTfold does not report cross-family generalization and does not make code or model weights available to facilitate such analyses; however, given its significant methodological similarities to E2Efold, it is likely to share E2Efold’s poor generalization.

The DL works examined thus far all formulate their inputs as a sequence of N RNA nucleotides. Although this is intuitive, several works reformat their inputs as an N×N matrix of nucleotide interactions instead. CNNFold [91], for example, formulates its input as an N×N×8 tensor, where each of the 8 values at the (i, j) cell encodes the specific nucleotide combination from the (i, j) positions as well as simple constraints on allowable secondary structure configurations. CNNFold applies a CNN to this representation to predict a binary N×N matrix describing pairings between nucleotides in the sequence that is post-processed to yield a final structure. UFold [62] adopts a similar approach, whereby the authors take the Kronecker product between a one-hot encoded representation of the RNA sequence and itself, arriving at a similar N×N×16 tensor. Unlike CNNFold, however, UFold’s input does not explicitly encode any constraints on allowable secondary structures, which allows it to be more flexible in predicting non-canonical interactions. Like CNNFold, UFold applies a CNN, specifically a U-Net [92] encoder–decoder architecture, to predict a binary contact map describing pairwise contacts between all nucleotides that is similarly post-processed to obtain a final structure. CDPfold [93] uses heuristics to construct an N×N input to a CNN predicting DBN. Compared with methods that use a sequence of N nucleotides as input, these methods explicitly enumerate all pairwise contacts regardless of the distance between nucleotides. This can potentially improve long-range modeling, which RNN and LSTM variants struggle with, but also means that input sizes grow quadratically with sequence length, greatly increasing computational complexity for longer sequences.

Thus far, we have focused on the usage of different neural network architectures and input formulations for predicting RNA secondary structure. Several works focus instead on training strategies for improving performance. SPOT-RNA applies a CNN followed by an LSTM to an RNA sequence to predict a pairwise contact map [84]. Although it is architecturally similar to other works, SPOT-RNA uniquely leverages transfer learning and ensembling techniques. Transfer learning aims to take advantage of larger, potentially unlabeled datasets to pre-train a general model first, and subsequently tweaks that general model toward a more specific target with less available data, thus improving generalizability. SPOT-RNA first pre-trains on many RNA structures with coarse structural annotations and fine-tunes using a smaller set of RNAs with more precise structures. Ensembling is a technique that trains multiple models for a given task, then averages their predictions together to improve overall performance [94, 95]. Although this consensus approach is effective, it is also computationally expensive, as each model must be separately trained and run.

2dRNA-LD [90] also leverages transfer learning but differentiates data on RNA length rather than structural coarseness. The authors reason that RNA molecules of similar length should behave more similarly and are thus easier to learn. Thus, they start by training their neural network (bi-LSTM and CNN) on a dataset of all RNAs of arbitrary length, then define five ranges of RNA lengths and fine-tune the model for each length category. The authors show that this transfer learning approach results in a small improvement compared with the original 2dRNA model [83], which is architecturally identical but leveraged a simpler training procedure.

Although transfer learning and ensembling techniques attempt to leverage available data more intelligently for learning, regularization aims to bias the learning process itself toward signals that our prior knowledge suggests are important [96]. MXfold2 [51] leverages a CNN followed by an LSTM to predict folding scores indicating the likelihood of various pairing interactions but critically regularizes the neural network toward thermodynamic priors. This includes both a free energy term that penalizes predicted structures that are thermodynamically unstable, as well as a loss that penalizes the folding scores themselves for deviating too far from thermodynamic estimations. The authors show that thermodynamic priors meaningfully improve MXfold2’s ability to generalize to unseen RNA families. This is supported by independent analyses performed by the authors of UFold.

More recently, there have been efforts integrating evolutionary and structural annotations into models predicting RNA secondary structure. SPOT-RNA2 [80] uses a similar transfer learning and ensembling approach as the original SPOT-RNA, method, but notably introduces additional inputs to the model: evolutionary conservation annotations and secondary structure predictions from LinearPartition [97], an energy-based secondary structure annotation method. The authors perform ablation studies to show that these additional inputs greatly boost performance.

Ultimately, these methods represent a diversity of ideas and are excellent examples of exploring what is possible when using data to learn patterns driving RNA secondary structure (Table 2). With these in mind, we would like to point out two remaining large challenges for ML in RNA secondary structure prediction. First is the prediction of non-canonical interactions and pseudoknot structures. Despite these interactions’ functional importance, they are relatively difficult to predict with high specificity and sensitivity. Of the 16 works discussed above, only 9 claims to predict pseudoknots; for non-canonical base pairing, most methods can explicitly predict the relatively common GU wobble pairing [10], but only 6 methods can predict arbitrary non-canonical interactions (Table 2). We hope to see more works that tackle these key structural targets in the future. Second is the consideration that these ML methods have yet to achieve broad generalizability. A recent work comprehensively benchmarking RNA secondary structure prediction methods found that although ML models improve upon traditional thermodynamic or bioinformatic approaches when evaluated under randomized data splits, they are substantially worse than these traditional approaches when evaluated under a family-based data split [56]. In other words, ML models perform well on sequences similar to training examples but do not learn truly general knowledge that extrapolates to structures meaningfully different from training examples. In the broader context of ML, it is extraordinarily difficult to develop truly generalizable models [98], so this limitation for predicting RNA structure is unsurprising. Nonetheless, this lack of generality and robustness of ML predictors of secondary structure may limit their wider adoption. For example, RNA tertiary structure prediction methods, many of which use secondary structure predictions as an intermediary step, frequently choose non-ML approaches for this subtask [23].

Table 2 Open in new tab

Summary of machine learning methods for predicting RNA secondary structure, including their basic architectural design, output formats, prediction capabilities on more challenging pseudoknot and non-canonical pairing schemes, and details regarding model availability. Non-canonical pairing predictions can take the form of allowing G-U wobble pairing predictions in addition to canonical Watson-–Crick base pairing predictions (“GU only”), or encompassing arbitrary pairing interactions (“All”). Methods that do not identify a binding partner for each nucleotide (i.e., only predicting paired/unpaired scores per nucleotide) do not describe pseudoknots or non-canonical interactions; corresponding columns are marked as N/A. For model availability, “code” means that code to train the model is available; “weights” indicate that trained model weights are available; “web server” indicates an online portal is available to use the method

Method . Output prediction . Predicts pseudoknots . Predicts non-canonical interactions . Model architecture . Code, model weights, or web server available . CONTRAfold [61]Pairwise contactNoAllCLLMCode, weights and web serverEternaFold [63]Pairwise contactNoAllCLLMCode, weights and web serverDMfold [70]Extended DBNYesNot claimed or evaluatedbi-LSTMCode onlyRNA-state-inf [71]Binary paired/unpairedN/AN/Abi-LSTMCode onlySPOT-RNA2 [80]Pairwise contactYesAllCNNCode, weights and web serverCROSS [81]Binary paired/unpairedN/AN/ACNN-likeWeb server onlyRPRes [82]Binary paired/unpairedN/AN/Abi-LSTM + CNNCode only2dRNA [83]Pairwise contactYesGU onlybi-LSTM + CNNWeb server only2dRNA-LD [90]Pairwise contactYesAllbi-LSTM + CNNWeb server onlySPOT-RNA [84]Pairwise contactYesAllCNN + bi-LSTMCode, weights and web serverMXfold2 [51]Per-nucleotide folding scoresNoGU onlyCNN + bi-LSTMCode, weights and web serverCNNFold [91]Pairwise contactYesGU onlyCNN (NxN input)Code and weightsUFold [62]Pairwise contactYesAll or GU only (configurable post-processing)CNN (NxN input)Code, weights and web serverCDPfold [93]DBNNoGU onlyCNN (N×N input)CodeE2Efold [89]Pairwise contactYesGU onlyTransformer + CNNCode and weightsATTfold [88]Pairwise contactYesGU onlyTransformer + CNNNoMethod . Output prediction . Predicts pseudoknots . Predicts non-canonical interactions . Model architecture . Code, model weights, or web server available . CONTRAfold [61]Pairwise contactNoAllCLLMCode, weights and web serverEternaFold [63]Pairwise contactNoAllCLLMCode, weights and web serverDMfold [70]Extended DBNYesNot claimed or evaluatedbi-LSTMCode onlyRNA-state-inf [71]Binary paired/unpairedN/AN/Abi-LSTMCode onlySPOT-RNA2 [80]Pairwise contactYesAllCNNCode, weights and web serverCROSS [81]Binary paired/unpairedN/AN/ACNN-likeWeb server onlyRPRes [82]Binary paired/unpairedN/AN/Abi-LSTM + CNNCode only2dRNA [83]Pairwise contactYesGU onlybi-LSTM + CNNWeb server only2dRNA-LD [90]Pairwise contactYesAllbi-LSTM + CNNWeb server onlySPOT-RNA [84]Pairwise contactYesAllCNN + bi-LSTMCode, weights and web serverMXfold2 [51]Per-nucleotide folding scoresNoGU onlyCNN + bi-LSTMCode, weights and web serverCNNFold [91]Pairwise contactYesGU onlyCNN (NxN input)Code and weightsUFold [62]Pairwise contactYesAll or GU only (configurable post-processing)CNN (NxN input)Code, weights and web serverCDPfold [93]DBNNoGU onlyCNN (N×N input)CodeE2Efold [89]Pairwise contactYesGU onlyTransformer + CNNCode and weightsATTfold [88]Pairwise contactYesGU onlyTransformer + CNNNo

CLLM: conditional log-linear model.

Table 2 Open in new tab

CLLM: conditional log-linear model.

Predicting tertiary structure

Although there exists a wide range of ML approaches that predict secondary structures from RNA primary sequence, methods for RNA tertiary structure prediction have traditionally relied on non-learning techniques and heuristics. The most popular, state-of-the-art methods like FARFAR2 [23] assemble secondary structure annotations (produced by non-learning methods) and known structures from evolutionarily similar RNAs to form an initial structural estimate and perform energy minimization to refine the structure. More recently, however, a handful of works have been proposed to tackle this challenge with DL (Table 3). These methods can be divided into approaches that score and rank pre-computed candidate RNA structures (Figure 1B) and works that directly generate the full 3D structural configuration of an RNA (Figure 1C).

Table 3 Open in new tab

Summary of machine learning methods predicting RNA tertiary structure, as well as their basic architectural design and prediction format, and availability details

Method . Output prediction . Model architecture . Code, model or web server available . RNA3DCNN [99]RMSD relative to the true structureCNNCode and weightsARES [100]RSMD relative to the true structureTensor field neural network [101]Code, weights and web serverPaxNet [102]RMSD relative to the true structureMessage-passing graph neural networkNotrRosettaRNA [103]Pairwise distances/geometriesTransformerCode, weights and web serverDeepFoldRNA [104]Pairwise distances/geometriesTransformerCode, weights and web serverE2Efold-3d [105]3D point cloudTransformer + invariant point attentionNoDRfold [106]3D point cloudTransformer + invariant point attentionCode, weights and web serverRoseTTAFoldNA [107]3D point cloudSE(3) transformerCode and weightsMethod . Output prediction . Model architecture . Code, model or web server available . RNA3DCNN [99]RMSD relative to the true structureCNNCode and weightsARES [100]RSMD relative to the true structureTensor field neural network [101]Code, weights and web serverPaxNet [102]RMSD relative to the true structureMessage-passing graph neural networkNotrRosettaRNA [103]Pairwise distances/geometriesTransformerCode, weights and web serverDeepFoldRNA [104]Pairwise distances/geometriesTransformerCode, weights and web serverE2Efold-3d [105]3D point cloudTransformer + invariant point attentionNoDRfold [106]3D point cloudTransformer + invariant point attentionCode, weights and web serverRoseTTAFoldNA [107]3D point cloudSE(3) transformerCode and weightsTable 3 Open in new tab

Summary of machine learning methods predicting RNA tertiary structure, as well as their basic architectural design and prediction format, and availability details

We first discuss methods that score pre-computed RNA structures relative to an unknown and unseen true structure. These methods rely on classical methods like FARFAR2, Gromacs, or Rosetta to produce candidate RNA structures and estimate the RMSD of each candidate structure to the true structure. The lowest RMSD hits are identified as final predicted structures. RNA3DCNN [99] predicts RMSD using a CNN applied to the volume ‘voxel’ surrounding each nucleotide. ARES [100] predicts RMSD using a tensor field network [101] that propagates information between local atoms in a spatially invariant fashion. PaxNet [102] leverages a similar message-passing scheme as ARES to train a graph neural network designed to capture local and global energetic interactions. Based on a recent evaluation of the RNA-puzzles competition, PaxNet appears to most effectively identify correct structures [102], followed by ARES, with RNA3DCNN exhibiting the poorest performance [100]. These performance deltas appear to be at least partially driven by differences in architectural design. Both ARES and PaxNet use network architectures that understand that rotational and translational shifts in atomic coordinates do not change the fundamental 3D geometry of the problem, whereas RNA3DCNN uses a less geometrically aware convolutional operation, which may explain its relatively poor performance. Compared with ARES, PaxNet utilizes more biophysical reasoning in its network design, which may help it achieve better performance. However, it is important to note that ARES and PaxNet sample more than three times as many candidate structures per experimental structure during training (1000 for ARES and PaxNet, compared to 300 for RNA3DCNN). More exhaustive sampling allows ARES and PaxNet to see more poor structures during training, which may lead to better recognition of incorrect patterns on new structures irrespective of architectural differences.

These scoring methods take advantage of existing non-learning RNA folding methods to reframe and simplify the task of 3D structural prediction. Rather than trying to directly generate coordinates that form valid structures, a task with many degrees of freedom, they select the best structure among a set of (presumably reasonable) candidates. However, their scoring and filtering approach has several downsides. Not only is the computational cost of generating hundreds to thousands of structures for a single prediction daunting, but these methods’ dependence on pre-generated structures means that they are fundamentally limited by the quality and accuracy of these generated structures. Indeed, this limitation may drive the need for thousands of candidate structures to be generated per RNA in the first place.

More recently, several DL models have been proposed that go beyond scoring precomputed structures, and predict RNA structures themselves. These works draw significant inspiration from recent ideas incorporated in protein structure folding methods like AlphaFold2 [108], RoseTTAFold [107], and trRosetta [109]—particularly the incorporation of evolutionary information in the form of multiple sequence alignments (MSAs) as model inputs—and can be broadly understood as leveraging either of two general strategies: methods that output a set of geometric constraints that are subsequently used to create a 3D structure, and methods that directly output 3D structure. Works that output sets of geometric constraints include trRosettaRNA [103] and DeepFoldRNA [104]. Given an input RNA sequence, both methods augment it with MSA and secondary structure annotations; MSA and secondary structure are then consumed by a transformer neural network to predict geometric constraints used to assemble a fully 3D structure using external tools. Both methods also use self-distillation to expand available training data beyond that of empirically-profiled RNA structures. In short, self-distillation uses a partially trained model to annotate unlabeled data, and then uses these imputed labels for additional training; although these imputed labels may contain noise, they nonetheless are helpful for reducing model overfit. These methods primarily differ in specifics surrounding secondary structure annotation and 3D structure assembly from constraints. DeepFoldRNA leverages PETFold [22], a non-learning method based on evolutionary and thermodynamic principles, for secondary structure annotation, whereas trRosettaRNA uses SPOT-RNA, a previously-discussed DL model. DeepFoldRNA’s structural assembly starts with the coarse-grained model generation, whereas trRosettaRNA directly optimizes full-atom models. Despite these slight differences, both these methods exhibit very similar performance on RNA-Puzzles targets [103].

End-to-end methods that directly produce 3D RNA structures include E2Efold-3d [105], DRfold [106] and RoseTTAFoldNA. E2Efold-3d takes an RNA sequence as input and follows a similar design as AlphaFold2, including the usage of input MSAs, invariant point attention mechanisms to apply attention in a geometrically aware fashion and recycling iterations to refine predicted structures. DRfold notably does not incorporate MSAs but includes secondary structure annotations. The sequence and secondary structure are passed through a similar transformer network as that used in E2Efold-3d as well as a second transformer to predict inter-residue geometries. The predicted structures and geometries are then integrated and jointly optimized to produce a final structure. RoseTTAFoldNA extends upon the RoseTTAFold [107] model architecture and differentiates from E2Efold-3d and DRfold in its ability to predict not just RNA tertiary structures, but general protein–nucleotide complexes for both DNA and RNA. This makes RoseTTAFoldNA the only model to our knowledge that can predict detailed RNA quaternary structure (and not just the presence of quaternary interaction). Not only is this generality impressive, but this also means that during training, RoseTTAFoldNA does not just learn from RNA structures, but a much richer and larger set including RNA-only complexes, RNA/DNA-protein complexes and protein-only complexes. This expanded training set likely helps improve performance, particularly given the relatively low quantity of RNA-only structures available.

These non-scoring tertiary structure works generally report predicted structure accuracies surpassing that of traditional methods like FARFAR2, though some report reduced performance when there are fewer MSAs or less reliable secondary structure estimates available (which itself may be correlated with lower homology to known RNA sequences). These strides in structure generation are thus highly complementary to the scoring improvements proposed by methods like PaxNet and ARES. Non-scoring models’ improved predictive accuracy can provide downstream scoring methods with a higher-quality input set, drastically increasing the odds of identifying the correct structure. Furthermore, DL candidate structures could be pooled with structures generated by traditional methods like FARFAR2 that may be less susceptible to performance drops on ‘out-of-distribution’ examples; a scoring method could then identify the best structure regardless of the method of origin. This could create a structure prediction pipeline that leverages both the accuracy of end-to-end DL tertiary structure prediction models and the generalizability of traditional methods like FARFAR2.

DISCUSSION AND FUTURE DIRECTIONS

Predicting RNA structures from RNA sequences remains an extremely challenging problem for ML methods. Although several ML methods claim strong performance on a held-out test set, outperforming non-learning techniques, these evaluations are frequently performed using randomized data splits. When evaluated on their ability to truly generalize to unseen, structurally distinct RNA families (not just unseen sequences), ML methods perform similarly to, and indeed frequently worse than, traditional non-learning techniques and human experts [110]. This problem is further exacerbated by the fact that RNA structures in databases overrepresent commonly studied systems, and thus bias trained models toward these particular structures [111]. Taken together, these observations make cross-family generalization one of the most pressing challenges facing the broader field of machines for RNA structure prediction. To this end, we would like to emphasize the importance of standardizing benchmarking of RNA structure prediction methods on clusters structurally distinct from training data. Future work on establishing standardized benchmark sets constructed in this manner, perhaps in a cluster-based k-fold cross-validation fashion, would be key to cementing this more rigorous benchmarking approach as a more consistent, reliable quantification of modeling improvements.

Beyond the need for consistent rigorous evaluation, we can examine the strongest and weakest modeling approaches to identify common threads and strategies for future work. Among ML approaches for RNA secondary structure, CONTRAfold (and its derivative, EternaFold) and MXfold2 appear to have the strongest, most consistent cross-family generalization [56, 62]. This suggests that simple models with relatively few learnable parameters (in the case of CONTRAfold) and models that leverage thermodynamic priors (in the case of MXfold2) to penalize solutions that deviate too greatly from prior knowledge appear to have an advantage. On the other hand, works on the opposite end of the complexity spectrum that train extremely large models with little to no priors imposed by regularization or architectural design (as in the case of E2Efold) appear to generalize particularly poorly. Somewhat surprisingly, there does not appear to be a particular DL architecture, be it LSTM, CNN or transformer, that fundamentally enables improved performance across the board. Furthermore, these broad trends favoring energetic priors are echoed when examining DL models for RNA tertiary structure prediction. Among structure-scoring models, the most accurate estimator, PaxNet, uses an architecture that is not only invariant with the geometry of the problem but is designed to capture different levels of energetic interactions in the overall structure.

These observations strongly highlight promising directions for future work. Recent works in the broader ML literature have also found that the incorporation of thermodynamic principles can greatly improve the performance and generalizability of DL models for modeling thermodynamically driven problems [112, 113]. Going forward works incorporating the biophysics and energetics of RNA folding into ML models are likely to greatly improve performance and generalizability. Furthermore, as illustrated by CONTRAfold’s enduring success, simple models may hold the key to robustly predicting RNA secondary structures. As one of CONTRAfold’s primary drawbacks is its lengthy runtime, particularly, for longer RNA sequences, future work could focus on how we can implement similarly simple yet effective models that exhibit better runtime complexity.

Beyond the exploration of simpler models and integration of thermodynamic priors, several works have pursued additional avenues that are also worth additional exploration. SPOT-RNA and SPOT-RNA2 focus on leveraging pre-training on RNA structures with a varying coarseness to improve secondary structure predictive performance. Future work could expand pre-training and similar strategies, such as self-distillation, to broaden training sets, perhaps including even non-RNA molecules as was successfully done by RoseTTAFoldNA. Several works in protein structure modeling have found that transformers pre-trained on large sets of amino acid sequences without structures nonetheless learn coarse structural contact information [114–116]. We anticipate that broad pre-training on RNA sequences could similarly implicitly learn structural signals. Transformer models like ATTfold and E2Efold may particularly benefit from incorporating pre-training strategies; both these models were trained from scratch using only RNAs with known structures and appear to generalize particularly poorly, whereas transformer models for other biological sequences are typically pre-trained on large datasets to avoid similar issues [87, 114, 117, 118].

Another promising direction is the increased integration of evolutionary information and auxiliary secondary structure estimates. Although most tertiary RNA structure prediction methods successfully leverage one or both of these data modalities, secondary structure prediction models do so much less often, with only SPOT-RNA2 successfully integrating evolutionary annotations. Although it might be counterintuitive to give a secondary structure estimator an input including energetically or evolutionarily estimated secondary structures, such an input may greatly improve these models’ generalizability. One might imagine a model that recognizes when an input sequence may form a structure too different from training data to be effectively predicted with learning-based methods, and in such cases leans more heavily on thermodynamic predictions. This may help prevent catastrophic failure on completely novel structures while retaining improved performance on structures similar to training examples.

Overall, ML for RNA structure prediction is a fast-moving field. Although there have been numerous excellent works exploring a wide variety of ideas, these ideas have yet to consistently and demonstrably outperform traditional approaches. However, despite the challenging nature of this field, the ability to predict RNA structures accurately could hold the key to greater biological understanding, as well as the ability to design novel RNA-based therapeutics. In this review, we have surveyed a wide range of these works and provided perspective on ideas that appear to hold promise in the future. We hope that our work can serve as a useful reference to guide researchers going forward.

Key Points

RNA molecules play a plethora of critical roles in the cell, and understanding its structure is critical to better understanding these functions.

Experimentally determining RNA structure, particularly tertiary structure, is often time-consuming. This has motivated a large body of research computationally predicting secondary and tertiary structure from RNA primary sequence.

We provide a review of ML works predicting both secondary and tertiary RNA structures.

We identify trends in methodologies that appear to be promising directions for improving the performance of these models going forward.

FUNDING

This research is supported by funding from the Chan-Zuckerberg Biohub.

DATA AVAILABILITY

This study does not produce or analyze new data.

REFERENCES 1.

Hirose T, Mishima Y, Tomari Y.

Elements and machinery of non-coding RNAs: toward their taxonomy. EMBO Rep 2014;15(5):489–507.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

Fricke M, Gerst R, Ibrahim B, et al.

Global importance of RNA secondary structures in protein-coding sequences. Bioinformatics 2019;35(4):579–83.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

Mauger DM, Cabral BJ, Presnyak V, et al.

mRNA structure regulates protein expression through changes in functional half-life. Proc Natl Acad Sci U S A 2019;116(48):24075–83.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

Yang X, Yang M, Deng H, Ding Y.

New era of studying RNA secondary structure and its influence on gene regulation in plants. Front Plant Sci 2018;9:671.

Google Scholar

OpenURL Placeholder Text

WorldCat

Vandivier LE, Anderson SJ, Foley SW, Gregory BD.

The conservation and function of RNA secondary structure in plants. Annu Rev Plant Biol 2016;67:463–88.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

Brown PH, Tiley LS, Cullen BR.

Effect of RNA secondary structure on polyadenylation site selection. Genes Dev 1991;5(7):1277–84.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

Sanchez de Groot N, Armaos A, Graña-Montes R, et al.

RNA structure drives interaction with proteins. Nat Commun 2019;10(1):3246.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

Brierley I, Pennell S, Gilbert RJ.

Viral RNA pseudoknots: versatile motifs in gene expression and replication. Nat Rev Microbiol 2007;5(8):598–610.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

Olson WK, Li S, Kaukonen T, et al.

Effects of noncanonical base pairing on RNA folding: structural context and spatial arrangements of G·A pairs. Biochemistry 2019;58(20):2474–87.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

10.

Varani G, McClain WH.

The G x U wobble base pair. A fundamental building block of RNA structure crucial to RNA function in diverse biological systems. EMBO Rep 2000;1(1):18–23.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

11.

Lemieux S, Major F.

RNA canonical and non-canonical base pairing types: a recognition method and complete repertoire. Nucleic Acids Res 2002;30(19):4250–63.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

12.

Staple DW, Butcher SE.

Pseudoknots: RNA structures with diverse functions. PLoS Biol 2005;3(6):e213.

Google Scholar

OpenURL Placeholder Text

WorldCat

13.

Hajdin CE, Bellaousov S, Huggins W, et al.

Accurate SHAPE-directed RNA secondary structure modeling, including pseudoknots. Proc Natl Acad Sci U S A 2013;110(14):5498–503.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

14.

Chen Y, Varani G.

RNA structure. eLS 2010.

Google Scholar

OpenURL Placeholder Text

WorldCat

15.

Jain S, Richardson DC, Richardson JS.

Computational methods for RNA structure validation and improvement. Methods Enzymol 2015;558:181–212.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

16.

Gruber AR, Lorenz R, Bernhart SH, et al.

The Vienna RNA websuite. Nucleic Acids Res 2008;36:W70–4.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

17.

Zuker M.

Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 2003;31(13):3406–15.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

18.

Huang L, Zhang H, Deng D, et al.

LinearFold: linear-time approximate RNA folding by 5′-to-3′ dynamic programming and beam search. Bioinformatics 2019;35(14):i295–304.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

19.

Tan Z, Fu Y, Sharma G, Mathews DH.

TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Res 2017;45(20):11570–81.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

20.

Sato K, Kato Y, Hamada M, et al.

IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. Bioinformatics 2011;27(13):i85–93.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

21.

Kiryu H, Kin T, Asai K.

Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics 2007;23(4):434–41.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

22.

Seemann SE, Gorodkin J, Backofen R.

Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res 2008;36(20):6355–62.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

23.

Watkins AM, Rangan R, Das R.

FARFAR2: improved de novo Rosetta prediction of complex global RNA folds. Structure 2020;28(8):963–976.e6.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

24.

Janiesch C, Zschech P, Heinrich K.

Machine learning and deep learning. Electronic Markets 2021;31(3):685–95.

Google Scholar

CrossrefSearch ADS

WorldCat

25.

Lorenz R, Wolfinger MT, Tanzer A, Hofacker IL.

Predicting RNA secondary structures from sequence and probing data. Methods 2016;103:86–98.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

26.

Schroeder SJ.

Advances in RNA structure prediction from sequence: new tools for generating hypotheses about viral RNA structure-function relationships. J Virol 2009;83(13):6326–34.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

27.

Mathews DH.

Revolutions in RNA secondary structure prediction. J Mol Biol 2006;359(3):526–32.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

28.

Seetin MG, Mathews DH.

RNA structure prediction: an overview of methods. Methods Mol Biol 2012;905:99–122.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

29.

Wei J, Chen S, Zong L, et al.

Protein-RNA interaction prediction with deep learning: structure matters. Brief Bioinform 2022;23(1):bbab540.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

30.

Torng W, Altman RB.

High precision protein functional site detection using 3D convolutional neural networks. Bioinformatics 2019;35(9):1503–12.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

31.

Xia Y, Xia CQ, Pan X, Shen HB.

GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 2021;49(9):e51.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

32.

Zhang J, Fei Y, Sun L, Zhang QC.

Advances and opportunities in RNA structure experimental determination and computational modeling. Nat Methods 2022;19(10):1193–207.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

33.

Zubradt M, Gupta P, Persad S, et al.

DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo. Nat Methods 2017;14(1):75–82.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

34.

Spitale RC, Flynn RA, Zhang QC, et al.

Structural imprints in vivo decode RNA regulatory mechanisms. Nature 2015;519(7544):486–90.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

35.

Rouskin S, Zubradt M, Washietl S, et al.

Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature 2014;505(7485):701–5.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

36.

Ding Y, Tang Y, Kwok CK, et al.

In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature 2014;505(7485):696–700.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

37.

Lorenz R, Bernhart SH, Höner zu Siederdissen C, et al.

ViennaRNA package 2.0. Algorithms Mol Biol 2011;6:26.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

38.

Zarringhalam K, Meyer MM, Dotu I, et al.

Integrating chemical footprinting data into RNA secondary structure prediction. PLoS One 2012;7(10):e45160.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

39.

Tomezsko PJ, Corbin VDA, Gupta P, et al.

Determination of RNA structural diversity and its role in HIV-1 RNA splicing. Nature 2020;582(7812):438–42.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

40.

Lu Z, Zhang QC, Lee B, et al.

RNA duplex map in living cells reveals higher-order transcriptome structure. Cell 2016;165(5):1267–79.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

41.

Aw JG, Shen Y, Wilm A, et al.

In vivo mapping of eukaryotic RNA interactomes reveals principles of higher-order organization and regulation. Mol Cell 2016;62(4):603–17.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

42.

Ziv O, Gabryelska MM, Lun ATL, et al.

COMRADES determines in vivo RNA structures and interactions. Nat Methods 2018;15(10):785–8.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

43.

Van Damme R, Li K, Zhang M, et al.

Chemical reversible crosslinking enables measurement of RNA 3D distances and alternative conformations in cells. Nat Commun 2022;13(1):911.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

44.

Sugimoto Y, Vigilante A, Darbo E, et al.

hiCLIP reveals the in vivo atlas of mRNA secondary structures recognized by Staufen 1. Nature 2015;519(7544):491–4.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

45.

Ramani V, Qiu R, Shendure J.

High-throughput determination of RNA structure by proximity ligation. Nat Biotechnol 2015;33(9):980–4.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

46.

Cannone JJ, Subramanian S, Schnare MN, et al.

The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 2002;3:2.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

47.

Griffiths-Jones S, Bateman A, Marshall M, et al.

Rfam: an RNA family database. Nucleic Acids Res 2003;31(1):439–41.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

48.

Rivas E, Clements J, Eddy SR.

A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs. Nat Methods 2017;14(1):45–8.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

49.

Danaee P, Rouches M, Wiley M, et al.

bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res 2018;46(11):5381–94.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

50.

Sloma MF, Mathews DH.

Exact calculation of loop formation probability identifies folding motifs in RNA secondary structures. RNA 2016;22(12):1808–18.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

51.

Sato K, Akiyama M, Sakakibara Y.

RNA secondary structure prediction using deep learning with thermodynamic integration. Nat Commun 2021;12(1):941.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

52.

Rose PW, Prlić A, Altunkaya A, et al.

The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res 2017;45(D1):D271–81.

Google Scholar

PubMedOpenURL Placeholder Text

WorldCat

53.

Mathews DH.

How to benchmark RNA secondary structure prediction accuracy. Methods 2019;162-163:60–7.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

54.

Zhang Y, Skolnick J.

Scoring function for automated assessment of protein structure template quality. Proteins 2004;57(4):702–10.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

55.

Mariani V, Biasini M, Barbato A, Schwede T.

lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 2013;29(21):2722–8.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

56.

Szikszai M, Wise M, Datta A, et al.

Deep learning models for RNA secondary structure prediction (probably) do not generalize across families. Bioinformatics 2022;38(16):3892–9.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

57.

Seemann SE, Mirza AH, Bang-Berthelsen CH, et al.

Does rapid sequence divergence preclude RNA structure conservation in vertebrates. Nucleic Acids Res 2022;50(5):2452–63.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

58.

Sükösd Z, Andersen ES, Lyngsø R.

SCFGs in RNA secondary structure prediction RNA secondary structure prediction: a hands-on approach. Methods Mol Biol 2014;1097:143–62.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

59.

Knudsen B, Hein J.

RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 1999;15(6):446–54.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

60.

Knudsen B, Hein J.

Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res 2003;31(13):3423–8.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

61.

Do CB, Woods DA, Batzoglou S.

CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 2006;22(14):e90–8.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

62.

Fu L, Cao Y, Wu J, et al.

UFold: fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Res 2022;50(3):e14.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

63.

Wayment-Steele HK, Kladwang W, Strom AI, et al.

RNA secondary structure packages evaluated and improved by high-throughput experiments. Nat Methods 2022;19(10):1234–42.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

64.

Zhang Y, Yang Q.

A survey on multi-task learning. IEEE Trans Knowl Data Eng 2021;34(12):5586–609.

Google Scholar

OpenURL Placeholder Text

WorldCat

65.

Akiyama M, Sato K, Sakakibara Y.

A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model. J Bioinform Comput Biol 2018;16(6):1840025.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

66.

Rezaur, Rahman, Chowdhury FA Zhang H, Huang L.

Learning to fold RNAs in linear time. bioRxiv 2019;852871.

Google Scholar

OpenURL Placeholder Text

WorldCat

67.

Hochreiter S, Schmidhuber J.

Long short-term memory. Neural Comput 1997;9(8):1735–80.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

68.

Ghosh S, et al.

Contextual lstm (clstm) models for large scale nlp tasks. 2016; arXiv preprint arXiv:160206291.69.

Hie B, Zhong ED, Berger B, Bryson B.

Learning the language of viral evolution and escape. Science 2021;371(6526):284–8.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

70.

Wang L, Liu Y, Zhong X, et al.

DMfold: a novel method to predict RNA secondary structure with pseudoknots based on deep learning and improved base pair maximization principle. Front Genet 2019;10:143.

Google Scholar

OpenURL Placeholder Text

WorldCat

71.

Willmott D, Murrugarra D, Ye Q.

Improving RNA secondary structure prediction via state inference with deep recurrent neural networks. Comput Math Biophys 2020;8(1):36–50.

Google Scholar

CrossrefSearch ADS

WorldCat

72.

Brown T, Mann B, Ryder N, et al.

Language models are few-shot learners. Adv Neural Inform Process Syst 2020;33:1877–901.

Google Scholar

OpenURL Placeholder Text

WorldCat

73.

O’Shea K, Nash R.

An introduction to convolutional neural networks. 2015; arXiv preprint arXiv:151108458.74.

Staden R.

Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 1984;12(1 Pt 2):505–19.

Google Scholar

PubMedOpenURL Placeholder Text

WorldCat

75.

Kelley DR, Snoek J, Rinn JL.

Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 2016;26(7):990–9.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

76.

Kelley DR, Reshef YA, Bileschi M, et al.

Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res 2018;28(5):739–50.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

77.

Zhang J, Liu B, Wang Z, et al.

DeepPN: a deep parallel neural network based on convolutional neural network and graph convolutional network for predicting RNA-protein binding sites. BMC Bioinformatics 2022;23(1):257.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

78.

Georgakilas GK, Grioni A, Liakos KG, et al.

Multi-branch convolutional neural network for identification of small non-coding RNA genomic loci. Sci Rep 2020;10(1):9486.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

79.

Yang KK, Lu AX, Fusi N.

Convolutions are competitive with transformers for protein sequence pretraining. bioRxiv. 2022; 2022.05.19.492714.80.

Singh J, Paliwal K, Zhang T, et al.

Improved RNA secondary structure and tertiary base-pairing prediction using evolutionary profile, mutational coupling and two-dimensional transfer learning. Bioinformatics 2021;37(17):2589–600.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

81.

Delli Ponti R, Marti S, Armaos A, Tartaglia GG.

A high-throughput approach to profile RNA structure. Nucleic Acids Res 2017;45(5):e35.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

82.

Wang L, Zhong X, Wang S, et al.

A novel end-to-end method to predict RNA secondary structure profile based on bidirectional LSTM and residual neural network. BMC Bioinformatics 2021;22(1):169.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

83.

Mao K, Wang J, Xiao Y.

Prediction of RNA secondary structure with pseudoknots using coupled deep neural networks. Biophys Rep 2020;6(4):146–54.

Google Scholar

CrossrefSearch ADS

WorldCat

84.

Singh J, Hanson J, Paliwal K, Zhou Y.

RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat Commun 2019;10(1):5407.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

85.

Vaswani A, Shazeer NM, Parmar N, et al.

Attention is all you need. Adv Neural Inform Process Syst 2017;30.

Google Scholar

OpenURL Placeholder Text

WorldCat

86.

Dosovitskiy A, et al.

An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations 2021.87.

Ji Y, Zhou Z, Liu H, Davuluri RV.

DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 2021;37(15):2112–20.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

88.

Wang Y, Liu Y, Wang S, et al.

ATTfold: RNA secondary structure prediction with pseudoknots based on attention mechanism. Front Genet 2020;11:612086.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

89.

Chen X, Li Y, Umarov R, et al.

RNA secondary structure prediction by learning unrolled algorithms. In International Conference on Learning Representations 2020.90.

Mao K, Wang J, Xiao Y.

Length-dependent deep learning model for RNA secondary structure prediction. Molecules 2022;27(3):1030.

Google Scholar

OpenURL Placeholder Text

WorldCat

91.

Saman Booy M, Ilin A, Orponen P.

RNA secondary structure prediction with convolutional neural networks. BMC Bioinformatics 2022;23(1):58.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

92.

Ronneberger O, et al.

Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015. Cham: Springer, 2015, 234–41.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

COPAC 93.

Zhang H, Zhang C, Li Z, et al.

A new method of RNA secondary structure prediction based on convolutional neural network and dynamic programming. Front Genet 2019;10:467.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

94.

Cao Y, Geddes TA, Yang JYH, Yang P.

Ensemble deep learning in bioinformatics. Nat Mach Intell 2020;2(9):500–8.

Google Scholar

CrossrefSearch ADS

WorldCat

95.

Ganaie MA, Hu M, Malik AK, Tanveer M, Suganthan PN.

Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence 2022;115:105151.96.

Tian Y, Zhang Y.

A comprehensive survey on regularization strategies in machine learning. Information Fusion 2022;80:146–66.

Google Scholar

CrossrefSearch ADS

WorldCat

97.

Zhang H, Zhang L, Mathews DH, Huang L.

LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities. Bioinformatics 2020;36(Suppl_1):i258–67.

Google Scholar

PubMedOpenURL Placeholder Text

WorldCat

98.

Recht B, Roelofs R, Schmidt L, Shankar V. Do ImageNet classifiers generalize to ImageNet. In:

Proceedings of the 36th International Conference on Machine Learning. Long Beach, California: International Conference on Machine Learning, 2019, Vol. 97, p. 5389–9.99.

Li J, Zhu W, Wang J, et al.

RNA3DCNN: local and global quality assessments of RNA 3D structures using 3D deep convolutional neural networks. PLoS Comput Biol 2018;14(11):e1006514.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

100.

Townshend RJL, Eismann S, Watkins AM, et al.

Geometric deep learning of RNA structure. Science 2021;373(6558):1047–51.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

101.

Thomas N, et al.

Tensor field networks: rotation-and translation-equivariant neural networks for 3d point clouds. 2018; arXiv preprint arXiv:180208219.102.

Zhang S, Liu Y, Xie L.

Physics-aware graph neural network for accurate RNA 3D structure prediction. 2022; arXiv preprint arXiv:221016392.103.

Feng C, et al.

Accurate de novo prediction of RNA 3D structure with transformer networkbioRxiv. 2022; 2022.10.24.513506.104.

Pearce R, Omenn GS, Zhang Y.

De novo RNA tertiary structure prediction at atomic resolution using geometric potentials from deep learning. bioRxiv 2022; 2022.05.15.491755.

Google Scholar

OpenURL Placeholder Text

WorldCat

105.

Shen T, et al.

E2Efold-3D: end-to-end deep learning method for accurate de novo RNA 3D structure prediction. 2022; arXiv preprint arXiv:220701586.106.

Li Y, Zhang C, Feng C, et al.

Integrating end-to-end learning with deep geometrical potentials for ab initio RNA structure predictionbioRxiv. 2022; 2022.12.30.522296.107.

Baek M, DiMaio F, Anishchenko I, et al.

Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021;373(6557):871–6.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

108.

Jumper J, Evans R, Pritzel A, et al.

Highly accurate protein structure prediction with AlphaFold. Nature 2021;596(7873):583–9.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

109.

Yang J, Anishchenko I, Park H, et al.

Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci U S A 2020;117(3):1496–503.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

110.

Miao Z, Adamiak RW, Antczak M, et al.

RNA-puzzles round IV: 3D structure predictions of four ribozymes and two aptamers. RNA 2020;26(8):982–95.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

111.

Flamm C, Wielach J, Wolfinger MT, et al.

Caveats to deep learning approaches to RNA secondary structure prediction. Front Bioinform 2022;2:835422.

Google Scholar

OpenURL Placeholder Text

WorldCat

112.

Hernandez Q, Badías A, González D, et al.

Deep learning of thermodynamics-aware reduced-order models from data. Comput Methods Appl Mech Eng 2021;379:113763.

Google Scholar

CrossrefSearch ADS

WorldCat

113.

Karniadakis GE, Kevrekidis IG, Lu L, et al.

Physics-informed machine learning. Nat Rev Phys 2021;3(6):422–40.

Google Scholar

CrossrefSearch ADS

WorldCat

114.

Lin Z, Akin H, Rao R, et al.

Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379(6637):1123–30.115.

Vig J, Madani A, Varshney LR, et al.

BERTology Meets Biology: Interpreting Attention in Protein Language Models. In International Conference on Learning Representations 2021.116.

Rao R, Meier J, Sercu T, et al.

Transformer protein language models are unsupervised structure learners. In International Conference on Learning Representations 2021.117.

Novakovsky G, Saraswat M, Fornes O, et al.

Biologically relevant transfer learning improves transcription factor binding prediction. Genome Biol 2021;22(1):280.

Google Scholar

CrossrefSearch ADS PubMed

WorldCat

118.

Wu K, et al.

TCR-BERT: learning the grammar of T-cell receptors for flexible antigen-xbinding analysesbioRxiv. 2021; 2021.11.18.469186. Author notes

James Y. Zou and Howard Chang are co-corresponding authors.

© The Author(s) 2023. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)

【本文地址】

Machine learning modeling of RNA structures: methods, challenges and future perspectives

Machine learning modeling of RNA structures: methods, challenges and future perspectives

今日新闻

推荐新闻